import pandas as pd
import numpy as np
import pickle
import sys; sys.path.append('modules')
from Visual import Visual as V
from LearningPipe import LearningPipe as lp
from KDE_naive_bayes import KDENB as nb
from sklearn.linear_model import LogisticRegression as lr
from sklearn.svm import SVC as svc
from sklearn.neighbors import KNeighborsClassifier as knn
from sklearn.ensemble import RandomForestClassifier as rf
from xgboost import XGBClassifier as xgb
The dataset is cleaned, standarized in the previous preprocessing phase and saved to npz files. We directly load them in this notebook.
In the this part of project, I will test 6 classifiers:
1. Naive Bayes
2. Logistic Regression
3. SVM
4. k-Nearest Neighbours
5. Random Forest
6. XGBoost
In different Feature spaces:
1. original
2. PCA
3. sparsePCA
4. FactorAnalysis
5. NMF
Before fitting models on the dataset, I use the extra tree model to select the most informative features to deduce the dimensionality. So the chance of overfitting and the running time of algorithms is reduced.
All the classifiers above have hyperparameter to tune except naive Bayes. To avoid overfitting, I will test each hyperparameter combination on the training set using 5-fold cross validation. And evaluate the performance using function explained below.
As explained with great detail in "Pattern Recognition and Machine Learning" (Bishop, 2011), the expected error can be expressed as:
$$Error = Bias^2 + Variance + (Noise)$$
In which the squared bias is the distance from expectation of prediction to the target value. In the case of classification, the distance from target value to correct prediction is 0 and to wrong prediction is 1. Since we are using 5-fold cross validation, $bias = mean(1 - accuracy) = 1 - mean(accuracy)$. And the variance is $var(accuracy)$.
*Because the noise term discribes the noise come with the data set, it is not reducible by tuning model, we don't consider it in the performance evaluation.
np.random.seed(seed=666)
clfNames = [
'Naive Bayes', 'Logistic Regression', 'SVM',
'k-Nearest Neighbours', 'Random Forest', 'XGBoost'
]
spaceNames = [
'original', 'PCA', 'sparsePCA', 'factorAnalysis', 'NMF'
]
dimPairs = [[0,1], [1,2], [2,3], [3,4]]
X_train_space = np.load("data/clean_dataset/X_train_space.npz")
X_test_space = np.load("data/clean_dataset/X_test_space.npz")
y_train = np.fromfile('data/clean_dataset/y_train', dtype=int)
y_test = np.fromfile('data/clean_dataset/y_test', dtype=int)
# feature selection
Xs_train = {}; Xs_test = {}
for space in spaceNames:
Xs_train[space], Xs_test[space] = lp.featureSelection(
X_train_space[space], X_test_space[space], y_train
)
print("Space: {0:17} n_features: {1}"\
.format(space, Xs_train[space].shape[1]))
We start with the simple generative model naive Bayes. As we can see in feature histograms below, each feature has slightly different distributions in 2 classes. So it makes sense to try naive Bayes.
But these features have more than 1 type of distributions. Some of them are normal, some of them are exponential and there are even ones not typical and multimodal. So we can't directly use Gaussian naive Bayes or Bernoulli naive Bayes. To accommodate various types of distributions, I implemented a naive Bayes classifier using Gaussian KDE as distribution estimator.
*Due to the large amount of time needed for grid searching, I will directly import the results generated before hand.
# v = V(5, 4, figsize=(18,10))
# v.plotHists(Xs_train, y_train, [0,1,2,3], spaceNames)

# models_nb = lp(nb, 'NaiveBayes', Xs_train, y_train)
# models_nb.gridSearching()
with open('data/performances/models_nb.pickle', 'rb') as f:
models_nb = pickle.load(f)
models_nb.results
As the grid searching results shows, although naive Bayes classifier performs a little better in sparsPCA space than in the other spaces, the accuracy is still quite low.
In general, naive Bayes is a relatively simple generative model. It simplifies and factorizes the joint distribution into the product of distributions of each single random variable, so it can not capture the relationships between variables thus the decision boundarys are always distributed parallel to axes.
As we can see in the scatter and decision boundary plots, the data points of 2 class are mixing together. Without new features, we can't make good predictions.
models_nb.plotDecisionContours(dimPairs, 4)
Next we move on to the simple discriminative classifier logistic regression. We will test how this low capacity model works on our dataset and then go on to other higher-capacity families of models.
From the scatter plots in feature engineering section we can tell the data points in any of those 5 spaces are not linear separable. In order to draw non-linear decision boundary and incorporate possible relationships between pairs of features, we can map the features into higher dimensional polynomial space. In this project, we will set degree equals 2.(We will test the situation of degree=3 with polynomial kernel in support vector machine)
Xs_train_poly = {}; Xs_test_poly = {}
for space in spaceNames:
Xs_train_poly[space], Xs_test_poly[space] = lp.polynomializeFeatures(
Xs_train[space], Xs_test[space], n_degree=2
)
Grid search is performed on parameters:
1. Regularization type(penalty): ['l1', 'l2']
2. Regularization coefficient(C): [1, 10, 100, 1000, 2000]
3. Maximum Iterations(max_iter): [100, 200, 500]
params_lr = {
'penalty':['l1', 'l2'],
'C':[1, 10, 100, 1000, 2000],
'max_iter':[100, 200, 500]
}
# models_lr = lp(lr, 'LogisticRegression', Xs_train_poly, y_train)
# models_lr.gridSearching(params=params_lr)
with open('data/performances/models_lr.pickle', 'rb') as f:
models_lr = pickle.load(f)
The performances are much better than that of naive Bayes classifiers. The mean accuracies are much higher and the variances are relatively low.
models_lr.results.head(n=10)
For each of the hyperparameters:
replaceCol = {'max_iter':'m_iter', 'penalty':'reg'}
models_lr.plotParamGrid(params_lr, replace=replaceCol)
models_lr.plotDecisionContours(dimPairs, 4)
models_lr.compareDecisionContour('factorAnalysis', [3,4], alpha=0.3)
In general, a linear classifier like logistic regression is too simple to achieve high accuracy even if we map the features into quadratic polynomial space. Thus the tuning of hyperparameters doesn't change the behavior of classifier a lot.
If we map them into higher dimensional space, the computation becomes expensive. This leads us to the support vector machine which deals with higher dimensional space more efficiently using kernel trick.
The main problem we have in previous 2 classifiers is about the features, they are not predictive enough. When we couldn't find more features, we can try to combine the existing ones in some mathmetical ways hoping to find some of them separating the classes well. And SVM helps us do this with kernel trick.
Grid search is performed on parameters:
1. kernel type: ['poly'(degree=3), 'rbf']
2. Error penalty(C): [10**1, 10**1.5, 10**2.0, 10**2.5, 10**3.5]
3. Kernel coefficient(gamma): [10**-3.0, 10**-2.0, 10**-1.5, 10**-1.0]
params_svc = {
'kernel':['poly', 'rbf'],
'C':(10**1, 10**1.5, 10**2.0, 10**2.5, 10**3.5),
'gamma':(10**-3.0, 10**-2.0, 10**-1.5, 10**-1.0),
'probability':[True]
}
# models_svc = lp(svc, 'SVM', Xs_train, y_train)
# models_svc.gridSearching(params=params_svc)
# models_svc.results = models_svc.results.drop('probability', axis=1)
with open('data/performances/models_svc.pickle', 'rb') as f:
models_svc = pickle.load(f)
Looking at the 10 best models, we can see the performance is not better than logistic regression. But the variances in sparsePCA feature space are all low.
models_svc.results.head(n=10)
For each of the hyperparameters:
models_svc.plotParamGrid(params_svc)
But the decision boundary plotting renders the classifiers almost useless because most of these svm models label most of the data points as unpopular. We will investigate this issue more by looking at positive and negative predictions in model selection section.
models_svc.plotDecisionContours(dimPairs, 4)
models_svc.compareDecisionContour('factorAnalysis', [3,4], alpha=0.3)
So far we have tested 3 classifiers, and the best ones are SVM(questionable) and logistic regression. But still, we hope to find models that predict better. What is in common about those 3 classifiers is that they are all feature-based. In other words, we have been manipulating features in different ways to make prediction. Maybe there is a limitation of this method for our dataset. So this time, we will try KNN which is instance-based, so that every prediction will be made by checking surrounding data points whose labels are already known. And this model has a very high representability.
Grid search is performed on parameters:
1. Number of neighbors: [20, 40, 80, 160]
2. Weighting methods: ['uniform', 'distance']
3. Distance metric: ['manhattan', 'euclidean', 'minkowski']
params_knn = {
'n_neighbors':[20, 40, 80, 160],
'weights':['uniform', 'distance'],
'metric':['manhattan', 'euclidean', 'minkowski']
}
# models_knn = lp(knn, 'k-NearestNeighbors', Xs_train, y_train)
# models_knn.gridSearching(params=params_knn)
with open('data/performances/models_knn.pickle', 'rb') as f:
models_knn = pickle.load(f)
Among the models in this family, the pattern of performances are quite different from the previous models. The accuracies are not as high as svm models in general, but the variances are all low, thus the expected loss are low.
models_knn.results.head(n=10)
For each of the hyperparameters:
replaceCol = {'n_neighbors':'n_ngbs'}
models_knn.plotParamGrid(params_knn, replace=replaceCol)
In general, benefit from its high representability, the knn models are able to separate the mixed data points better than the other models. This matchs the decision boundary plotting below, the boundaries are rough thus captures finer pattern.
models_knn.plotDecisionContours(dimPairs, 4)
By comparing the models before and after the hyperparameter tuning, we can see the decision boundary are smoothened mainly due to the increasing of neighbor number. Thus the models are less prone to overfitting.
models_knn.compareDecisionContour('sparsePCA', [3,4], alpha=0.3)
After trying both feature-based classifiers and instance-based classifiers, the performances are still not satisfying enough. We will try ensemble method which combines the decision of multiple classifiers. So the rest 2 classifiers we are testing, random forest and extra gradient boosting, are both tree based emsemble classifiers. The former one, random forest classifier simply combines decision by averaging the predictions of an assigned number of tree classifiers. And the latter one, extra gradient boosting, focus more on the data points which are difficult to classify.
Grid search is performed on parameters:
1. Number of trees: [100, 200, 400, 800, 1600]
2. Maximal depth of each tree: [10, 20, 30]
3. Minimal leaf node sample size: [2, 5, 10, 20, 40]
params_rf = {
'n_estimators':(100, 200, 400, 800, 1600),
'max_depth':(5, 10, 20, 30),
'min_samples_split':(2, 5, 10, 20, 40)
}
# models_rf = lp.LearningPipe(rf, 'RandomForest', Xs_train, y_train)
# models_rf.gridSearching(params=params_rf)
with open('data/performances/models_rf.pickle', 'rb') as f:
models_rf = pickle.load(f)
Investigaitng the performances, the random forest models are indeed doing quite well. The accuracies are all over 66% thus the expected loss are lower. Although the variances are higher than the other model families.
models_rf.results.head(n=10)
For each of the hyperparameters:
replaceCol = {'max_depth':'depth', 'min_samples_split':'leaf', 'n_estimators':'tree'}
models_rf.plotParamGrid(params_rf, replace=replaceCol)
Similar to naive Bayes, tree classifiers look at only one feature at one time so the decision boundaries are distributed parallel to the axes. But because random forest are combining decisions of many trees, so the model is much more flexible than the naive Bayes.
Compared with default model, we have much more trees in the forest and the leaf size are much bigger. Thus the decision boundary are much smoother.
models_rf.plotDecisionContours(dimPairs, 4)
models_rf.compareDecisionContour('original', [2,3], alpha=0.3)
The extra gradient boost is a popular emsemble model. By starting from simple classifiers, it keeps adding new tree classifiers based on the wrong predictions.
Grid search is performed on parameters:
1. Subsample ratio of each time of fitting: [0.2, 0.5, 0.8, 1]
2. Minimum loss reduction required to make a split(gamma): [0, 0.2, 0.5, 1]
3. Coefficient of L1 regularization(alpha): [0.001, 0.01, 0.1, 1]
params_xgb = {
'subsample':[0.2, 0.5, 0.8, 1],
'gamma':[0, 0.2, 0.5, 1],
'reg_alpha':[0.001, 0.01, 0.1, 1],
'silent':[1],
'nthread':[4],
'objective':['binary:logistic']
}
# models_xgb = lp.LearningPipe(xgb, 'ExtraGradientBoost', Xs_train, y_train)
# models_xgb.gridSearching(params=params_xgb)
# models_xgb.results = models_xgb.results.drop(['nthread', 'objective', 'silent'], axis=1)
with open('data/performances/models_xgb.pickle', 'rb') as f:
models_xgb = pickle.load(f)
Shown in the performance evaluations of extra gradient boost classfiers, the accuracies are even higher than the random forest classifiers but the variances are also high. High variances seems to be a common problem of emsemble methods on this dataset.
models_xgb.results.head(n=10)
For each of the hyperparameters:
replaceCol = {'subsample':'bag', 'reg_alpha':'reg_l1'}
models_xgb.plotParamGrid(params_xgb, replace=replaceCol)
models_xgb.plotDecisionContours(dimPairs, 4)
models_xgb.compareDecisionContour('original', [2,3], alpha=0.3)
Taking one model out of each family in each feature space with best hyperparameter setting, we can plot them out and compare their performances visually.
models = [
models_nb, models_lr, models_svc, models_knn, models_rf, models_xgb
]
performances = []
for model in models: performances.append(model.bestPerfs)
performances = pd.concat(performances, axis=0).reset_index()
# because mean accuracy, variance and expected loss are in different scale.
# So we need to standardize them before showing them in the line plotting.
# standardize = lambda sr: (sr - sr.mean()) / (sr.max() - sr.min())
# visuallize loss var and bias
# v = V(1, 1, figsize=(15, 8))
# v.plotLines(
# performances.index,
# [standardize(performances['expectedLoss']),
# standardize(performances['mean']),
# standardize(performances['variance'])]
# )
Ideally, we should select models with high mean accuracy and low variances. Judging from the plotting we can easily tell naive Bayes are not good for us. Although the variances are not steady, the random forest and extra gradient boost have the lowest expected losses.
The best 10 models are listed below.
bestPerfs = performances.sort_values('expectedLoss')\
.reset_index().head(n=12)
bestPerfs
Remember the SVM classifiers have a issue of label most of data points as one class. To avoid selecting models have such kind of behavior, we need to investigate the confusion matrix. Here, we use confusion histograms, which are the detail version of confusion matrix, show us how confident those 10 classifiers are when they are making predictions and how many predictions are correct or wrong.
# v = V(3, 4, figsize=(18, 12))
# v.plotConfusionHists(
# bestPerfs['confusionHist'],
# bestPerfs['clf'],
# bestPerfs['space']
# )

These confusion histograms assure us that none of these classifiers make unbalanced predictions. And when these classifiers make confident right predictions, they also make confident mistakes. None of them are obviously better than the others. So we may choose the one with lowest expected loss, which is extra gradient boost in original space, as our final model.